Harry Potter - An Analysis on the Moods

Introduction

As the famous saying goes, “there are a thousand Hamlets in a thousand people’s eyes”, this project aims to inspect on this theory through the eyes of modern technology, what sentiment “reeks” of the text in the Harry Potter series by J.K.Rowling.

Questions to answer include, but are not limited to: Does the sentiment analysis on single words tell the true story? How does sentiment change as readers (in this case, RStudio and certain lexicon packages) progress through the chapters in each book? If the sentiment is analyzed on a bigram basis, how does the overall result change? What about analyzing on a sentence basis?

Data Source

Data of this project is leveraged from a harrypotter package found on github. It contains text from the seven novels in the Happy Potter series written by J.K.Rowling, which are:
1. philosophers_stone: Harry Potter and The Philosopher’s Stone (1997)
2. chamber_of_secrets: Harry Potter and The Chamber of Secrets (1998)
3. prisoner_of_azkaban: Harry Potter and The Prisoner of Azkaban (1999)
4. goblet_of_fire: Harry Potter and The Goblet of Fire (2000)
5. order_of_the_phoenix: Harry Potter and The Order of the Phoenix (2003)
6. half_blood_prince: Harry Potter and The Half Blood Prince (2005)
7. deathly_hallows: Harry Potter and The Deathly Hallows (2007)

Data Preparation

This data is pretty clean in its nature, but since sentiment analysis is conducted, certain cleaning steps are taken to better prepare the dataset.

Order of the Books

Since later on in this analysis, we’re getting to the point to see how sentiments and moods have evolved as we progress through the chapters, making sure the books are listed in order is rather important.

Term Frequency

Term frequency is a farily important concept to see the terms that frequently occur. Certain insights could be drawn from this result as if a word appears more often, it might have more leverage in impacting the final sentiment scores.

Overall

Start by getting a basic idea of the most frequently used words. “Looked”, “time”, “wand” (undoubtedly) are among the top most frequent used words in all seven books.

By Book

Compute the term frequencies of words, show the result of the top 10 used words by book. “Time”, “looked”, and “eyes” show up almost in all books.

Term Frequency - Inverse Document Frequency (Unigrams)

tf-idf shows how important a word is to a document. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

By Book

Compute the tf-idf table by book. The words deemed as “important” are more context-related.

Bigrams

Words should be viewed from a collective sense - because in the literature world, many words are meaningless if analyzed individually. Thus, it might provide more in-depth insight to see what word set are used most frequently. Investigating the bigrams might be a good start.

Most Frequently Used

The most frequently used bigrams are mostly context-related.

Term Frequency - Inverse Document Frequency (Bigrams)


A Table

From the tf-idf table below, we could see the top 10 most “important” bigrams by book. “Death eater” seems to be pretty important to four out of the seven books in the series, which echoes with it being the bigram with top presence when text from seven books is analyzed as a whole.

## # A tibble: 340,021 x 2
##    bigram         n
##    <chr>      <int>
##  1 of the      4895
##  2 in the      3571
##  3 said harry  2626
##  4 he was      2490
##  5 at the      2435
##  6 to the      2386
##  7 on the      2359
##  8 he had      2138
##  9 it was      2123
## 10 out of      1911
## # … with 340,011 more rows
## # A tibble: 70,371 x 2
##    bigram                 n
##    <chr>              <int>
##  1 uncle vernon         386
##  2 death eaters         346
##  3 invisibility cloak   192
##  4 dark arts            176
##  5 death eater          164
##  6 entrance hall        145
##  7 daily prophet        125
##  8 mad eye              116
##  9 hospital wing        107
## 10 madame maxime         94
## # … with 70,361 more rows

Sentiment of Unigrams

Three lexicons are contained in the sentiments dataset of the tidytext package, which are AFINN, bing, and nrc. All of the three lexicons are based on unigrams, and these single words are assigned scores for sentiment according to their different categorization. We’re going through each of these three lexicons in this section.

nrc Lexicon

The nrc lexicon regulates words in a binary fashion (0 or 1, a.k.a no or yes) into ten categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

According to the nrc lexicon, most words are ruled as “negative”, while the runner-up is “positive”, with a difference of roughly 5500 words. And then it’s “fear”, and “trust”, which is about 2000 words shy. In general, there is a stronger negative presence than positive.

AFINN Lexicon

The AFINN lexicon maps words with a score ranging from -5 to 5. A negative score indicates negative sentiment and a positive score indicates positive sentiment.

From this result, we could see that the words with the highest absolute values of sentiment are mostly negative.

bing Lexicon

The bing lexicon assigns words in a binary fashion into positive and negative.

Overall
We can see that according to “bing”, around 68% of all the words are catgeorized as negative.

## Joining, by = "word"
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative  38973
## 2 positive  17938
## Joining, by = "word"

By Book
According to “bing”, generate how the net sentiment changes as readers progress through each book. For a better output, break up each book by 500 words (the approximate number of words on every two pages) and store them into index accordingly.

The sentiment shows more variation in all six groups, except for Order of the Phoenix - which shows more negativity throughout the whole book.

Comparison

To see how AFINN, bing, and nrc lexicons lead to different results of the sentiment analysis. It might be helpful to benchmark the results. For a better output, break up each book by 500 words (the approximate number of words on every two pages) and store them into index accordingly. The result shows how these lexicons impact the sentiment scores by each book to investigate questions like: is there any book that has similar sentiment scores from the three lexicons?

From the result, we could see that nrc produces different sentiment scores, while AFINN and bing might show more similar pattern in categorizing words. However, how much coincidence is factored in this result needs further investigation.

sentiafinn <- hp %>%
        group_by(book) %>% 
        mutate(word_count = 1:n(),
               index = word_count %/% 500 + 1) %>% 
        inner_join(get_sentiments("afinn")) %>%
        group_by(book, index) %>%
        summarise(sentiment = sum(value)) %>%
        mutate(method = "AFINN")

#sentibing <- hp %>%
#  group_by(book) %>%
#  mutate(word_count = 1:n(),
#                         index = word_count %/% 500 + 1) %>% 
#                  inner_join(get_sentiments("bing")) %>%
#                  mutate(method = "Bing")

#sentinrc <- hp %>%
#  group_by(book) %>% 
#                  mutate(word_count = 1:n(),
#                         index = word_count %/% 500 + 1) %>%
#                  inner_join(get_sentiments("nrc") %>%
#                                     filter(sentiment %in% c("positive", "negative"))) %>%
#                  mutate(method = "NRC")
  
sentibingnrc <- bind_rows(hp %>%
                  group_by(book) %>% 
                  mutate(word_count = 1:n(),
                         index = word_count %/% 500 + 1) %>% 
                  inner_join(get_sentiments("bing")) %>%
                  mutate(method = "bing"),
                  hp %>%
                  group_by(book) %>% 
                  mutate(word_count = 1:n(),
                         index = word_count %/% 500 + 1) %>%
                  inner_join(get_sentiments("nrc") %>%
                                     filter(sentiment %in% c("positive", "negative"))) %>%
                  mutate(method = "nrc")) %>%
        count(book, method, index = index , sentiment) %>%
        ungroup() %>%
        spread(sentiment, n, fill = 0) %>%
        mutate(sentiment = positive - negative) %>%
        select(book, index, method, sentiment)

bind_rows(sentiafinn, 
          sentibingnrc) %>%
        ungroup() %>%
        mutate(book = factor(book, levels = titles)) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_grid(book ~ method) +
  scale_fill_manual(labels = letters[1:3], values = c("coral3", "darkgoldenrod1","aquamarine4"))+
  labs(title = "Sentiment Score by Different Lexicons", subtitle = "by book, grouped by every 500 words", x = "(#words/500)", y = "sentiment score")

Sentiment of Bigrams

As sentiment analysis might be misleading when a negation word is presented right before the word that is mapped to a sentiment/emotion lexicon word. Some common negation words include, but are limited to “no”, “not”, “doesn’t”, “isn’t”, “wasn’t’,”can’t“,”hardly“,”barely“,”wasn’t“,”can’t“,”hardly“,”barely“,”won’t“,”don’t“,”shouldn’t“,”wouldn’t“,”couldn’t", and etc. Thus, the following analysis aims to see if what words that are previously done in the sentiment analysis might lead to outcome of opposite direction.

A Table

The table shows the most used bigrams that start with a negation word.

The (Probably) Falsely Categorized Sentiment

Now, incorporate the bing lexicon again. Solely from scanning through the first few pages of the table, we could see that most used second words that follow a negation word are categorized as negative. Thus, this might be the cause of negative sentiment of single words being much more of that the positive sentiment.

Sentiment of Sentences

Though tokenizing at the single word level might be useful in some scenarios, but conducting sentiment analysis on a different units of text might be more applicable in the literature word. In the sentimentr, there are analysis algorithms that grant sentiment analysis on a sentence level. The following results would show in each book, the net sentiment by chapter and then by sentence. The following analysis is done based on AFINN since the sentiment score is of a laid-out range.

Sentiment vs. Book Sales

U.S. First Print Run

Since sales from the first print run has a lot to do with the series’ burgeoning popularity overtime. Thus, it would be fair to compare the sales of first print run among the first three books, and then among the last four books. As the sentiment does not really differ among the first three, we could not really see a pattern nor association here. Among Order of the Phoenix, Half-blood Prince and Goblet of Fire, sentiment seems to have certain effect on the sales: the higher (more positive) of the sentiment score, the higher the sales is.

Another way of looking at this first print run sales issue might be that: does the sentiment score of the previous book have impact on the sales of the following book? For example, it might be due to the positivity shown in Half-Blood Prince, that the sales of Deathly Hallows is relatively high. But this might also due to the fact that Deathly Hallows is the highly expected last book in this series.

UK First Print Run

In UK, it might be that the Goblet of Fire shows more positivity, so the sales of Order of the Phoenix is relatively higher. Again, the logic of this explanation might be a bit far-fetched since in real life scenarios, many things would have impacted the sales of a certain book in a series.

Sentiment vs. Box Office Worldwide

Note that there is one movie corresponding to one book, except for there are part 1 and part 2 for Harry Potter and the Deathly Hallows. Thus, for this section, the box office corresponding to the last book is calculated as the average of part 1 (960.4 million USD) and part 2 (1.342 billion USD), which is 1.1512 billion USD.

The result (sort of) shows a relationship that if the average sentiment in each book (based on AFINN sentiment) is higher/positive (in despite of average sentiment for every book is actually negative), the higher the box office will be. However, other factors need to be taken into account, such as the first and the last movies are likely to launch higher box office.

Sentiment vs. Film Rating

Since negativity in text of book might indicate possible frequent presence of strong language, violence, drug abuse, nudity, and etc. in the corresponding movie. Again, since both part 1 and part 2 of Deathly Hallows were given the same film rate, here these two movies are synthesized into one row of entry.

Here, we could see that sentiment of the book text is sort of associated with the film rates of the corresponding movie. But this result might be due to the fact that the average sentiment scores are not really different from each other among the seven books and thus the eight movies. Also, the lines and scripts of the movies are not necessarily the same of what the book tells; thus solely using the sentiment reflected in the book text to associate it with the film ratings of the movies might not be a fairly logical approach.

Sentiment vs. Rating (IMDb, Rotten Tomatoes)

It might be interesting to see if there is any relationship between the sentiment of book text and the ratings for the corresponding movies - since more negativity might actually draw more audience (because it is kind of the trend for a while). For the last book, there are two movies, part 1 gets 7.7/10 on IMDb, and 77% on Rotten Tomatoes; part 2 gets 8.1/10 on IMDb, and 96% on Rotten Tomatoes. The numbers used here, in order to map each book to “one” movie, are the averages of the ratings for part 1 and part 2.

From this plot, we could see that the online ratings for the movie do not have certain correlation with the sentiment of the book that it corresponds.

hprating <- read.csv("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/HPRating.csv")

names(hprating)[names(hprating)=="movie"] <- "book"

hpratingsenti <- hprating %>%
  right_join(sentiafinn2) 

hpratingsenti[1,2] = "7.9"
hpratingsenti[2,2] = "7.6"
hpratingsenti[4,2] = "7.7"
hpratingsenti[1,3] = "8.65"
hpratingsenti[2,3] = "8.3"
hpratingsenti[4,3] = "8.8"
hpratingsenti[1,4] = "8.275"
hpratingsenti[2,4] = "7.95"
hpratingsenti[4,4] = "8.25"
hpratingsenti[1,5] = "19-Nov-10"
hpratingsenti[2,5] = "15-Jul-09"
hpratingsenti[4,5] = "18-Nov-05"

hpratingsenti$imdb <- as.numeric(hpratingsenti$imdb)
hpratingsenti$rotten_tomatoes <- as.numeric(hpratingsenti$rotten_tomatoes)
hpratingsenti$avg_two <- as.numeric(hpratingsenti$avg_two)
hpratingsenti$sentiment <- as.numeric(hpratingsenti$sentiment)
hpratingsenti$release_date <- as.Date(hpratingsenti$release_date,  "%d-%B-%y")

ggplot(hpratingsenti)+
  geom_line(aes(x = release_date, y = imdb), color = "azure3") +
  geom_line(aes(x = release_date, y = rotten_tomatoes), color = "coral2") +
  geom_line(aes(x = release_date, y = avg_two), color = "cadetblue3") +
  geom_point(mapping = aes(release_date, imdb, color = sentiment)) +
  geom_point(mapping = aes(release_date, rotten_tomatoes, color = sentiment)) +
  geom_point(mapping = aes(release_date, avg_two, color = sentiment)) +
  scale_color_gradient(low = "gold4", high = "yellow") +
  theme_bw() +
  labs(title = "Sentiment vs. Online Ratings for Corresponding HP Movie", x = "Release Date", y = "Ratings(out of 10)")

Topic Modeling

preparation work

preparation work (cont.)

## <<TermDocumentMatrix (terms: 19196, documents: 200)>>
## Non-/sparse entries: 209841/3629359
## Sparsity           : 95%
## Maximal term length: 49
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##           Docs
## Terms             107       120        125       126       171         186
##   —          0.000000  0.000000  0.0000000  0.000000  0.000000   0.0000000
##   ¨c         0.000000  0.000000  0.0000000  0.000000 92.507307 129.0233497
##   dobby      0.000000  0.000000  0.0000000  0.000000  0.000000  32.2759186
##   hagrid     1.449986 16.674838 83.3741895 20.299803  2.899972   0.0000000
##   lupin      1.415037  0.000000  0.0000000  2.830075  5.660150   0.0000000
##   sirius     2.028999  6.086997  3.0434987  2.028999  0.000000   0.0000000
##   slughorn   0.000000  0.000000  0.0000000  0.000000  0.000000   0.0000000
##   snape     14.002346  5.895725  0.7369656  3.684828  0.000000   0.7369656
##   umbridge 123.214085 22.631159 30.1748781 37.718598  0.000000   0.0000000
##   vernon     0.000000  0.000000  0.0000000  0.000000  0.000000  15.0874390
##           Docs
## Terms            189        194        196        81
##   —         0.000000   0.000000   0.000000  0.000000
##   ¨c       38.950445 165.539392 177.711406  0.000000
##   dobby     2.689660   0.000000   0.000000 59.172517
##   hagrid    0.724993   5.799944   0.000000 10.874894
##   lupin     0.000000   1.415037   8.490225  0.000000
##   sirius    0.000000   0.000000   7.101497  2.028999
##   slughorn  0.000000   0.000000   0.000000  0.000000
##   snape     0.000000   1.473931  78.855319  5.158759
##   umbridge  0.000000   0.000000   0.000000  0.000000
##   vernon    0.000000   0.000000   0.000000  2.514573

preparation work (cont.)

## New names:
## * `1` -> ...1
## * `2` -> ...2
## * `3` -> ...3
## * `4` -> ...4
## * `5` -> ...5
## * … and 195 more problems
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## Warning: Setting row names on a tibble is deprecated.
##                  V1        V2           V3           V4
## umbridge  20234.721  102.8372 2.220446e-16 5.849498e+01
## professor 11051.292  103.7483 1.337324e+03 2.362579e+03
## lupin     10190.473 4246.1728 2.127880e+03 2.220446e-16
## snape     10047.151 3759.9352 7.032213e+00 3.092425e+03
## hagrid     8475.255  705.3663 3.947349e+03 5.777697e+03
##                     V1        V2           V3           V4
## ¨c        2.220446e-16 42063.296 5.089918e-11 5.691338e-10
## griphook  1.018192e+01  8475.596 8.811202e+02 2.836854e+00
## eater     6.882399e+02  8426.354 2.003849e+02 1.257583e+03
## voldemort 6.814381e+02  7182.504 9.542275e+02 1.380955e+03
## kreacher  6.581217e+02  7055.402 3.277940e+03 1.351365e+03
##                   V1           V2       V3           V4
## vernon  2.220446e-16 8.319926e+01 16320.80 9.286498e-09
## uncle   8.879144e+01 8.458021e-13 13430.88 2.047613e+02
## dudley  4.120097e+01 1.197749e+01 12426.54 1.550563e-10
## petunia 2.673152e+01 4.371715e+02 10587.57 1.654380e-04
## mrs     1.514712e+02 1.543847e+03 10455.68 1.166612e+03
##                    V1           V2           V3        V4
## —        2.220446e-16 3.722575e-15 2.220446e-16 33665.324
## slughorn 3.707542e+01 3.896770e+02 8.329534e-12 15004.200
## bagman   2.602750e+01 3.562215e+00 4.700756e+01 10867.256
## winky    5.316458e+01 3.884784e+00 1.158944e+01  9951.494
## moody    2.805123e+02 2.878832e+02 3.147092e+03  9827.710

preparation work (cont.)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...

preparation work (cont.)

## Removing 11663 of 22659 terms (11663 of 136872 tokens) due to frequency 
## Your corpus now has 100 documents, 10996 terms and 125209 tokens.

To determine the best number of topics, focus on the semantic coherence and the residuals. Semantic coherence means how well the words hang together – computed from taking a conditional probability score from frequent words; it is essentially a measure of human interpretability. Thus, low residual and high semantic coherence is preferred. In this case, as k passes 10, the residuals seemingly take a sharp dive as k increases. Thus, choose 10 as the k.

From the plot down below, we could see that topic 2 is the most expected topic is topic 2, which contains words like “harry”, “ron”, and “said”. (of course - but a better way to do this might be finding a way to eliminate all words related to main characters, the Hogwarts Houses, and etc. in the first place). The topic that is expected to cover the second most proportion is topic 5, which has words “said”, and “harry”and “ron” again. As you move through the ten topics, all contain harry, and come with other main characters’ names like hermione and dumbledore!

Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 2: FEEX words are umbridge, parvati, cho, angelina, potions, katie, dobby. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Here, under topic 2, highest probability words are said, harry, ron, hermione, just, well, professor. All fairly forseeable.

## Topic 1 Top Words:
##       Highest Prob: harry, said, uncle, vernon, aunt, dudley, petunia 
##       FREX: marge, dudley, aunt, vernon, uncle, petunia, dudleys 
##       Lift: brutuss, horsy, mentally, pixies, sightless, venemous, waistband 
##       Score: marge, vernon, dudleys, petunia, figg, uncle, dursley 
## Topic 2 Top Words:
##       Highest Prob: said, harry, ron, hermione, just, well, professor 
##       FREX: umbridge, parvati, cho, angelina, potions, katie, dobby 
##       Lift: bezoar, hearthrug, losers, -one, accurate, afternoons, applied 
##       Score: katie, angelina, parvati, umbridge, kreacher, slughorn, luna 
## Topic 3 Top Words:
##       Highest Prob: harry, said, moody, ron, around, back, cedric 
##       FREX: krum, cedric, champions, moody, task, moodys, tournament 
##       Lift: -large, acceptance, bubbly, diamond, disqualified, fiber, flicker 
##       Score: moody, champions, krum, bagman, merpeople, karkaroff, enclosure 
## Topic 4 Top Words:
##       Highest Prob: said, harry, dumbledore, snape, now, back, looked 
##       FREX: fudge, dumbledore, slughorn, pensieve, memory, snape, morfin 
##       Lift: arabella, augustus, bites, blades, blaming, characteristic, cheerfulness 
##       Score: slughorn, morfin, fudge, snape, hepzibah, crouch, figg 
## Topic 5 Top Words:
##       Highest Prob: harry, said, ron, hermione, professor, back, lupin 
##       FREX: scabbers, wood, crookshanks, gryffindor, pettigrew, lupin, team 
##       Lift: axe, execution, flobberworms, malfay, nooo, prongs, sabotage 
##       Score: scabbers, hooch, lupin, flint, pettigrew, rat, blacks 
## Topic 6 Top Words:
##       Highest Prob: said, harry, weasley, ron, mrs, hermione, fred 
##       FREX: scrimgeour, weasley, mrs, weasleys, arthur, percy, charlie 
##       Lift: -heat, -tray, accommodate, accuse, adoringly, ancestors, announcement 
##       Score: scrimgeour, tents, roberts, kreacher, bagman, crouch, charlie 
## Topic 7 Top Words:
##       Highest Prob: harry, said, hermione, ron, dumbledore, know, wand 
##       FREX: xenophilius, hallows, grindelwald, deathly, godrics, lily, kreacher 
##       Lift: deathstick, erumpent, gellert, hallows, ¨cit, ¨cof, againso 
##       Score: xenophilius, hallows, grindelwald, kreacher, griphook, deathly, regulus 
## Topic 8 Top Words:
##       Highest Prob: harry, said, hagrid, hermione, ron, back, yeh 
##       FREX: yeh, ter, yer, hagrid, don, rita, fer 
##       Lift: bidin, breeds, couldn, crinkled, croakily, crocodile-skin, determinedly 
##       Score: hagrid, yeh, yer, bagman, rita, ollivander, ter 
## Topic 9 Top Words:
##       Highest Prob: harry, said, wand, hermione, ron, back, around 
##       FREX: bellatrix, greyback, prophecy, cattermole, diary, travers, riddle 
##       Lift: ¨cstop, arch, battling, bawled, betrayal, black-hooded, blistered 
##       Score: griphook, travers, greyback, bellatrix, bellatrixs, dais, cattermole 
## Topic 10 Top Words:
##       Highest Prob: harry, said, back, hagrid, ron, hermione, looked 
##       FREX: diadem, tonks, nick, kingsley, motorbike, mad-eye, crabbe 
##       Lift: better-looking, carrows, dedaluss, dizzying, hagridharry, motorbike, whereve 
##       Score: diadem, motorbike, kingsley, tonks, luna, dedalus, bike

Now, see what text from the book that have higher probabilities of being associated with each topic. After computing this step, the result appears to be a LONG excerption from the book (apparently), thus the output here is hidden.

Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.

##               [,1]         [,2]         [,3]         [,4]         [,5]
##  [1,] 0.2730288220 4.266334e-04 3.724610e-04 2.707508e-05 0.0008390219
##  [2,] 0.0001049162 1.810393e-04 1.498009e-04 8.360836e-06 0.0002413514
##  [3,] 0.0094009254 2.739393e-04 2.218427e-04 4.836820e-04 0.7071116815
##  [4,] 0.0002889296 1.472327e-04 2.210055e-05 9.385872e-05 0.9986132588
##  [5,] 0.0001760840 1.484458e-04 2.218044e-05 9.395180e-05 0.9987215784
##  [6,] 0.0015030774 7.539111e-04 8.299145e-04 2.074743e-04 0.1622878153
##  [7,] 0.0042489279 2.258018e-04 2.886410e-04 1.699895e-04 0.0006834611
##  [8,] 0.6935701057 4.590232e-04 1.434213e-04 1.367515e-04 0.3037400590
##  [9,] 0.0004164102 6.869285e-04 2.069677e-04 6.839662e-05 0.4600250960
## [10,] 0.0001110533 9.973328e-01 2.689690e-04 8.800624e-04 0.0005345626
## [11,] 0.0008652314 5.908224e-01 6.753583e-04 3.614958e-03 0.2198512988
## [12,] 0.0005260726 1.712185e-03 4.997754e-01 5.264215e-04 0.0018317711
## [13,] 0.0042321812 1.333026e-03 3.851515e-04 2.028674e-01 0.2160993806
## [14,] 0.9988696781 1.063736e-04 1.211603e-04 5.682283e-05 0.0003356242
## [15,] 0.0096753282 8.189861e-05 4.409979e-05 3.660200e-04 0.1658359967
##               [,6]         [,7]         [,8]         [,9]        [,10]
##  [1,] 0.0003900969 4.134061e-05 0.7246271312 4.116701e-06 2.433012e-04
##  [2,] 0.0001773218 1.442521e-05 0.9990761915 4.859815e-07 4.610687e-05
##  [3,] 0.0005136848 2.802837e-01 0.0009092437 4.088504e-04 3.924747e-04
##  [4,] 0.0001080784 1.070262e-04 0.0003396418 2.201595e-04 5.971372e-05
##  [5,] 0.0001018846 1.230228e-04 0.0003491499 2.168536e-04 4.684855e-05
##  [6,] 0.0361072470 2.726899e-04 0.7976254325 1.193374e-04 2.931006e-04
##  [7,] 0.8318028442 1.506552e-04 0.1621862564 5.096658e-05 1.924564e-04
##  [8,] 0.0002922679 1.496150e-04 0.0010415188 2.707309e-04 1.965067e-04
##  [9,] 0.0005444883 1.207367e-04 0.5377455360 3.002840e-05 1.554116e-04
## [10,] 0.0000368077 5.040847e-05 0.0004952048 2.087689e-04 8.135646e-05
## [11,] 0.1309241545 4.100243e-04 0.0465740086 5.036496e-03 1.226048e-03
## [12,] 0.0011369750 5.922394e-04 0.0004289024 4.928849e-01 5.851253e-04
## [13,] 0.0017248717 3.829752e-04 0.0001493163 5.722928e-01 5.328308e-04
## [14,] 0.0001308991 4.428319e-05 0.0001729407 5.780201e-05 1.044160e-04
## [15,] 0.8219002720 3.121304e-04 0.0013000300 3.133676e-04 1.708567e-04

Image Detection

To see how image detection works with some of the most famous scenes from the Harry Potter movies. I pick these three pictures as follows to experiment on.

Buckbeak’s Flight

–from Harry Potter and the Prisoner of Azkaban



Lakeside, seashore, volcano are among the highest probability – not bad!

## $file
## [1] "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp1.jpg"
## 
## $type
##        label probability
## 1   lakeside  0.16771728
## 2   seashore  0.13642623
## 3    sandbar  0.11876366
## 4 promontory  0.09261914
## 5    volcano  0.04462162

Setting the threshold to 0.1 here. It works out pretty well!



Voldemort Vs. Dumbledore

– from Harry Potter and the Order of the Phoenix



To be fair, those sparkles really look like a barn spider, caldron, fountain, or volcano. And a space shuttle… probably not that wrong in this case.

## $file
## [1] "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp2.jpg"
## 
## $type
##           label probability
## 1   barn spider  0.06893399
## 2       caldron  0.04882356
## 3      fountain  0.04150808
## 4       volcano  0.03183554
## 5 space shuttle  0.03058341

I tried setting the threshold to multiple amounts, but the results did not turn out well. Here is a prediction picture when the threshold is set to 0.07. Since there is definitely not a cow nor a car in this scene, does this have something to do with the dim light?



The Patronus

– from Harry Potter and the Prisoner of Azkaban



The image detection is not really working well here. Reasons to be discovered later on.

## $file
## [1] "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp3.jpg"
## 
## $type
##          label probability
## 1     ice bear  0.03816724
## 2 killer whale  0.02890317
## 3       valley  0.02306473
## 4 electric ray  0.02027919
## 5  tiger shark  0.01995512

Again, detection on this image is not working really well. And again, probably has something to do with the low light and dark tone. Down below is the prediction image when the threshold’s set to 0.15.



Takeaway & Looking Forward

While sentiment analysis on single words might appear useful in some scenarios, this is not likely to be the case for literature, at least not for the Harry Potter series. As we proceed through the analysis, we could see that through single word sentiment analysis, negativity basically triumphs in all seven books. However, taking in the issue of negation words by conducting sentiment analysis of bigrams, we could see that the single word sentiment analysis might be misleading. From the sentiment analysis on the whole sentences, this flaw of single word sentiment becomes more pronounced.

Also, from the comparison of the results of sentiment analysis based on AFINN, bing, and nrc lexicons, we could see that they are not necessarily identical to each other. The differences are deeply rooted in the logic hidden behind each lexicon.

Thus, the accuracy of sentiment analysis is largely dependent on the methodology chosen. Everything should probably be done based on context. Further, more investigation on the relationship between analysis on unigrams, larger units of text, and sentences is needed.

As for the topic modeling and image detection - I’ve had fun and want to discover more image detection technology used on pictures with darker tones and lower light.

Lastly, as I created my own stop_words dictionary to eliminate the expected frequently used words like the main characters, the Hogwarts House names, and etc. I want to find better ways so as terms like, for example, “harry’s”, could be eliminated as well.

To be continued.

Katherine Li